Search CORE

9 research outputs found

Speakers Fill Lexical Semantic Gaps with Context

Author: Blasi Damián
Cotterell Ryan
Maudslay Rowan Hall
Pimentel Tiago
Publication venue
Publication date: 01/11/2020
Field of study

Lexical ambiguity is widespread in language, allowing for the reuse of economical word forms and therefore making language more efficient. If ambiguous words cannot be disambiguated from context, however, this gain in efficiency might make language less clear -- resulting in frequent miscommunication. For a language to be clear and efficiently encoded, we posit that the lexical ambiguity of a word type should correlate with how much information context provides about it, on average. To investigate whether this is the case, we operationalise the lexical ambiguity of a word as the entropy of meanings it can take, and provide two ways to estimate this -- one which requires human annotation (using WordNet), and one which does not (using BERT), making it readily applicable to a large number of languages. We validate these measures by showing that, on six high-resource languages, there are significant Pearson correlations between our BERT-based estimate of ambiguity and the number of synonyms a word has in WordNet (e.g.

\rho = 0.40

in English). We then test our main hypothesis -- that a word's lexical ambiguity should negatively correlate with its contextual uncertainty -- and find significant correlations on all 18 typologically diverse languages we analyse. This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.Comment: Camera ready version of EMNLP 2020 publication. Code is available in https://github.com/tpimentelms/lexical-ambiguity-in-contex

arXiv.org e-Print Archive

Repository for Publications and Research Data

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

UniMorph 4.0:Universal Morphology

Author: Aiton Grant
Anastasopoulos Antonios
Andrushko Taras
Angulo Candy
Arora Aryaman
Ataman Duygu
Ate Yustinus Ghanggo
Batsuren Khuyagbaatar
Bautista Juan López
Baxi Jatayu
Bayyr-ool Aziyana
Bella Gábor
Bernardy Jean-Philippe
Bhatt Brijesh
Budianskaya Elena
Camaiteri Delio Siticonatzi
Chodroff Eleanor
Coler Matt
Cotterell Ryan
Cruz Hilaria
Czarnowska Paula
Dirix Peter
Dolatian Hossep
Ek Adam
El-Khaissi Charbel
Francis Didier López
Ganieva Sofya
Gasser Michael
Giunchiglia Fausto
Goldman Omer
Gorman Kyle
Guriel David
Habash Nizar
Hatcher Richard J.
Hennigen Lucas Torroba
Hulden Mans
Ivanova Sardana
Karahóǧa Ritván
Khalifa Salam
Kieraś Witold
Klyachko Elena
Krizhanovskaya Natalia
Krizhanovsky Andrew
Kumar Ritesh
Lane William
Leonard Brian
Liu Zoey
Marchenko Igor
Markantonatou Stella
Mashkovtseva Polina
Maudslay Rowan Hall
McCarthy Arya D.
Mielke Sabrina J.
Nepomniashchaya Maria
Nicolai Garrett
Nikkarinen Irene
Nuriah Zahroh
Oncevay Arturo
Pavlidis George
Pimentel Tiago
Pinter Yuval
Plugaryov Matvey
Ponti Edoardo M.
Prud'hommeaux Emily
Raj Mohit
Ratan Shyam
Rodionova Daria
Rojas Esaú Zumaeta
Ryskina Maria
Salchak Aelita
Salehi Ali
Salesky Elizabeth
Samame Jaime Rafael Montoya
Scherbakov Andrey
Serova Alexandra
Sheifer Karina
Silfverberg Miikka
Stoehr Niklas
Straughn Christopher
Suhardijanto Totok
Tsarfaty Reut
Tyers Francis M.
Valvoda Josef
Vania Clara
Villegas Gema Celeste Silva
Vylomova Ekaterina
Washington Jonathan North
White Jennifer
Wolinski Marcin
Yablonskaya Anna
Yarowsky David
Yemelina Anastasia
Young Jeremiah
Zariquiey Roberto
Zmigrod Ran
Publication venue: 'Center for Open Science'
Publication date: 07/05/2022
Field of study

ARTS repository - University of Groningen

Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probing

Author: Cotterell Ryan
Hall Maudslay Rowan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/06/2021
Field of study

Analysing whether neural language models encode linguistic information has become popular in NLP. One method of doing so, which is frequently cited to support the claim that models like BERT encode syntax, is called probing; probes are small supervised models trained to extract linguistic information from another model’s output. If a probe is able to predict a particular structure, it is argued that the model whose output it is trained on must have implicitly learnt to encode it. However, drawing a generalisation about a model’s linguistic knowledge about a specific phenomena based on what a probe is able to learn may be problematic: in this work, we show that semantic cues in training data means that syntactic probes do not properly isolate syntax. We generate a new corpus of semantically nonsensical but syntactically well-formed Jabberwocky sentences, which we use to evaluate two probes trained on normal data. We train the probes on several popular language models (BERT, GPT-2, and RoBERTa), and find that in all settings they perform worse when evaluated on these data, for one probe by an average of 15.4 UUAS points absolute. Although in most cases they still outperform the baselines, their lead is reduced substantially, e.g. by 53% in the case of BERT for one probe. This begs the question: what empirical scores constitute knowing syntax

arXiv.org e-Print Archive

Repository for Publications and Research Data

Speakers Fill Lexical Semantic Gaps with Context

Author: Blasi Damián
Cotterell Ryan
Hall Maudslay Rowan
Pimentel Tiago
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/11/2020
Field of study

Lexical ambiguity is widespread in language, allowing for the reuse of economical word forms and therefore making language more efficient. If ambiguous words cannot be disambiguated from context, however, this gain in efficiency might make language less clear—resulting in frequent miscommunication. For a language to be clear and efficiently encoded, we posit that the lexical ambiguity of a word type should correlate with how much information context provides about it, on average. To investigate whether this is the case, we operationalise the lexical ambiguity of a word as the entropy of meanings it can take, and provide two ways to estimate this—one which requires human annotation (using WordNet), and one which does not (using BERT), making it readily applicable to a large number of languages. We validate these measures by showing that, on six high-resource languages, there are significant Pearson correlations between our BERT-based estimate of ambiguity and the number of synonyms a word has in WordNet (e.g. � = 0.40 in English). We then test our main hypothesis—that a word’s lexical ambiguity should negatively correlate with its contextual uncertainty—and find significant correlations on all 18 typologically diverse languages we analyse. This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative

Repository for Publications and Research Data

Metaphor Detection Using Context and Concreteness

Author: Cotterell Ryan
Maudslay Rowan Hall
Pimentel Tiago
Teufel Simone
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/07/2020
Field of study

We report the results of our system on the Metaphor Detection Shared Task at the Second Workshop on Figurative Language Processing 2020. Our model is an ensemble, utilising contextualised and static distributional semantic representations, along with word-type concreteness ratings. Using these features, it predicts word metaphoricity with a deep multi-layer perceptron. We are able to best the state-of-the-art from the 2018 Shared Task by an average of 8.0% F1, and finish fourth in both sub-tasks in which we participate

Repository for Publications and Research Data

A Tale of a Probe and a Parser

Author: Cotterell Ryan
Hall Maudslay Rowan
Pimentel Tiago
Valvoda Josef
Williams Adina
Publication venue: Association for Computational Linguistics
Publication date: 01/01/2020
Field of study

Measuring what linguistic information is encoded in neural models of language has become popular in NLP. Researchers approach this enterprise by training “probes”—supervised models designed to extract linguistic structure from another model’s output. One such probe is the structural probe (Hewitt and Manning, 2019), designed to quantify the extent to which syntactic information is encoded in contextualised word representations. The structural probe has a novel design, unattested in the parsing literature, the precise benefit of which is not immediately obvious. To explore whether syntactic probes would do better to make use of existing techniques, we compare the structural probe to a more traditional parser with an identical lightweight parameterisation. The parser outperforms structural probe on UUAS in seven of nine analysed languages, often by a substantial amount (e.g. by 11.1 points in English). Under a second less common metric, however, there is the opposite trend—the structural probe outperforms the parser. This begs the question: which metric should we prefer

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref